Topic/cuda aware communications by bosilca · Pull Request #671 · ICLDisco/parsec

bosilca · 2024-09-10T04:35:01Z

Add support for sending and receiving the data directly from and to devices. There are few caveats (noted on the commit log).

Note: because it includes the span renaming, this PR changes teh public API and will need to bump version to 5.x

The first question is how is such a device selected ?

The allocation of such a copy happen way before the scheduler is invoked
for a task, in fact before the task is even ready. Thus, we need to
decide on the location of this copy only based on some static
information, such as the task affinity. Therefore, this approach only
works for owner-compute type of tasks, where the task will be executed
on the device that owns the data used for the task affinity.

Pass the correct data copy across the entire system, instead of
falling back to data copy of the device 0 (CPU memory)

TODOs

abouteiller · 2024-10-11T18:09:56Z

Now passing 1-gpu/node, 8 ranks PTG POTRF
Sorry I had to force-push there were issues with rebasing on master

G-Ragghianti · 2025-01-13T18:27:43Z

I think we need to create a CI test that targets gpu_nvidia and issues the job to that runner, correct?

abouteiller · 2025-01-13T21:12:41Z

Failure in stress (and similar in nvlink) due to the code generating a pushback event when transferring the last tile between the GEMM -> DISCARD_C flow (m >= mt+1). This tile has no original->device_copies[0] because it was created directly without a backing DC (from a NEW in MAKE_C).

See further discussion in #671 (comment)

d@00000 GPU[1:cuda(0)]: Retrieve data (if any) for GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_scheduler:2719
d@00000 GPU[1:cuda(0)]: Try to Pop GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_pop:2264
d@00000 GPU[1:cuda(0)]: read copy 0x7ff06462f970 [ref_count 1] on flow A has readers (1) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: read copy 0x7ff064002c10 [ref_count 2] on flow C has readers (0) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: OUT Data copy 0x7ff064002c10 [ref_count 2] for flow C @parsec_device_kernel_pop:2330
Process 2891337 stopped
* thread #11, name = 'stress', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x70)
    frame #0: 0x00007ffff7eaa89b libparsec.so.4`parsec_device_kernel_pop(gpu_device=0x0000555555f7b7b0, gpu_task=0x00007ff06462a8c0, gpu_stream=0x0000555555f7bc68) at device_gpu.c:2341:17
   2338             if( gpu_task->pushout & (1 << i) ) {
   2339                 /* TODO: make sure no readers are working on the CPU version */
   2340                 original = gpu_copy->original;
-> 2341                 PARSEC_DEBUG_VERBOSE(10, parsec_gpu_output_stream,
   2342                                     "GPU[%d:%s]:\tMove D2H data <%s:%x> copy %p [ref_count %d] -- D:%p -> H:%p requested",
   2343                                     gpu_device->super.device_index, gpu_device->super.name, flow->name, original->key, gpu_copy, gpu_copy->super.super.obj_reference_count,
   2344                                      (void*)gpu_copy->device_private, original->device_copies[0]->device_private);

~~Potential fix is to allocate a dev0copy like is done for the network received tiles, not sure why it doesn't already.~~

devreal · 2025-02-22T21:36:01Z

parsec/data.c

+             * from the data and eventually release their memory.
+             */
+            parsec_data_copy_detach(data, copy, copy->device_index);
+            zone_free((zone_malloc_t *)copy->arena_chunk, copy->device_private);


Has this been checked for per-tile allocated data? The release of that memory is different and should not go into the zone allocator.

devreal · 2025-02-22T22:00:11Z

parsec/mca/device/device_gpu.c

+                           gpu_device->super.device_index, gpu_device->super.name, original->key, (void*)gpu_copy->device_private);
        }
-        assert(0 != (gpu_copy->flags & PARSEC_DATA_FLAG_PARSEC_OWNED) );
+        assert(0 != (gpu_copy->flags & PARSEC_DATA_FLAG_PARSEC_OWNED));


I know this has been here before, but: how should we handle copies that are not owned by parsec? They still seem to end up in the LRU even if they are only managed, not owned.

abouteiller · 2025-02-24T22:11:59Z

I merged with master but apparently I missed a defect in erroneous cases printouts, that causes the CI failures.

parsec/data.c

abouteiller · 2025-02-26T19:13:56Z

parsec/mca/device/device_gpu.c

+            gpu_copy->coherency_state = PARSEC_DATA_COHERENCY_SHARED;
+            assert(PARSEC_DATA_STATUS_UNDER_TRANSFER == cpu_copy->data_transfer_status);
+            cpu_copy->data_transfer_status = PARSEC_DATA_STATUS_COMPLETE_TRANSFER;
+            if( 0 == (parsec_mpi_allow_gpu_memory_communications & PARSEC_RUNTIME_SEND_GPU_MEMORY) ) {


Followup to #671 (comment)
Issue with stress and friends is stemming from here: when we pushout and we have a successor task, the successor task must receive the cpu copy as input (otherwise it will reference a gpu_copy that is now in the read LRU).

A fix would need to distinguish between the case where we pushout to satisfy a communication and we pushout to satisfy the input of a CPU-only task.

if a pushout is requested then we should always pass back the CPU copy (after it was updated). I need to understand the case yo are describing here. Is there a simple reproducer I can play with ?

The reproducer is the stress CI tester

abouteiller · 2025-02-26T20:47:11Z

I merged with master but apparently I missed a defect in erroneous cases printouts, that causes the CI failures.

ci failure due to using data_in uninitialized in new context nc during ontask, introduced by changes to task_snprintf in 108b778

devreal · 2025-03-26T14:44:42Z

I see dangling copies on the device. This might just require a fix in the cleanup code (ignoring data that only has the device copy):

[****] TIME(s)      7.35572 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  102044.227028 gflops - ENQ&PROG&DEST      7.75205 :   96827.061209 gflops - ENQ      0.39626 - DEST      0.00008
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11ca00000) and it is discarding it!
[****] TIME(s)      6.82278 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  110015.002854 gflops - ENQ&PROG&DEST      6.82296 :  110012.229403 gflops - ENQ      0.00009 - DEST      0.00008
[****] TIME(s)      6.70474 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111951.924658 gflops - ENQ&PROG&DEST      6.70490 :  111949.236353 gflops - ENQ      0.00009 - DEST      0.00007
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fded0a00000) and it is discarding it!
[****] TIME(s)      6.64373 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  112979.994900 gflops - ENQ&PROG&DEST      6.64388 :  112977.490061 gflops - ENQ      0.00009 - DEST      0.00006
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11aa00000) and it is discarding it!
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe138a00000) and it is discarding it!
[****] TIME(s)      6.71053 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111855.299517 gflops - ENQ&PROG&DEST      6.71068 :  111852.904738 gflops - ENQ      0.00009 - DEST      0.00005

Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <gbosilca@nvidia.com>

abouteiller · 2025-03-28T14:15:06Z

I see dangling copies on the device. This might just require a fix in the cleanup code (ignoring data that only has the device copy):

[****] TIME(s)      7.35572 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  102044.227028 gflops - ENQ&PROG&DEST      7.75205 :   96827.061209 gflops - ENQ      0.39626 - DEST      0.00008
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11ca00000) and it is discarding it!
[****] TIME(s)      6.82278 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  110015.002854 gflops - ENQ&PROG&DEST      6.82296 :  110012.229403 gflops - ENQ      0.00009 - DEST      0.00008
[****] TIME(s)      6.70474 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111951.924658 gflops - ENQ&PROG&DEST      6.70490 :  111949.236353 gflops - ENQ      0.00009 - DEST      0.00007
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fded0a00000) and it is discarding it!
[****] TIME(s)      6.64373 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  112979.994900 gflops - ENQ&PROG&DEST      6.64388 :  112977.490061 gflops - ENQ      0.00009 - DEST      0.00006
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11aa00000) and it is discarding it!
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe138a00000) and it is discarding it!
[****] TIME(s)      6.71053 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111855.299517 gflops - ENQ&PROG&DEST      6.71068 :  111852.904738 gflops - ENQ      0.00009 - DEST      0.00005

This warning is overcautious now that we have GPU-only copies created from the network. Ideally we would find a way to discriminate between the real leakages from the application and these temporaries being reclaimed.

bosilca · 2025-03-28T18:11:55Z

The original associated with these device owned copies should not have a valid dc ?

devreal · 2025-07-14T17:36:16Z

Here is what I think happens in the stress benchmark:

We allocate a new C tile, which has a host and device copy.
At the end of the GEMM task, the device copy gets passed through the reshape code and a reference is added to the device copy because that is the copy that is captured in the parsec_dep_data_description_t:

    data.data   = this_task->data._f_A.data_out;

We release the host copy at the end of release_deps_of_stress_GEMM. However, the host copy only has a single reference (the data) and so the host copy is detached from the parsec_data_t. Next time we use that parsec_data_t (in the next GEMM) we are missing the host copy.

I don't understand the reshape code and I was hoping to never have to touch it. I suspect that the reshape code was not designed with GPUs in mind but I could be wrong. Will need some help digging through this and figuring out

Whether capturing the device copy in reshape is correct.
How we can make sure that the host copy is not accidentally released.

bosilca · 2026-02-13T05:42:19Z

For completeness let me try to describe what the issue was. In fact @abouteiller pointed to it earlier in the comments, but it went unnoticed.

To clarify: in order to allow sending to remote successors directly from the GPU, this PR made a drastic change by passing to successors the data copy the task has worked on, instead of the CPU version. However, it did not allowed the two versions to drift apart, it updated the CPU version to mirror the device copy (which is great as the two are now equivalent). But by passing the device data copy it alters the way we track the reference count for copies (propagate the output copies to successors to trigger their ref count update before releasing the refcounts on the current task input). In most cases, it works just fine because the CPU copies has other references to it, at least because it belongs to a data collection.

But in this test we are using a NEW data copy, one that is not associated with any data collection but instead specifically created for the chain of GEMM task. The first time we create this data, we also create the CPU mirror for it, but that mirrored copy has a refcount of 1 because it is only referenced once (ir does not belong to a data collection). So, when the first GEMM task in the chain passes the GPU copy to its successor and releases it's own input copies, it releases the only reference to the CPU copy allowing the runtime to dispose of the copy. This is bad because now we have a valid device copy passing around tasks that does not have a backend CPU copy.

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

This allows to check if the data can be send and received directly to and from GPU buffers. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

This is a multi-part patch that allows the CPU to prepare a data copy mapped onto a device. 1. The first question is how is such a device selected ? The allocation of such a copy happen way before the scheduler is invoked for a task, in fact before the task is even ready. Thus, we need to decide on the location of this copy only based on some static information, such as the task affinity. Therefore, this approach only works for owner-compute type of tasks, where the task will be executed on the device that owns the data used for the task affinity. 2. Pass the correct data copy across the entire system, instead of falling back to data copy of the device 0 (CPU memory) Add a configure option to enable GPU-aware communications. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

copy if we are passed-in a GPU copy, and we need to retain/release the copies that we are swapping

readers

…ut-only flows, for which checking if they are control flows segfaults

accessor

Mismatch in printf type sizes might lead to segmentation faults. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Add missing dependency library. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

This disable the communications from GPU memory, but it is necessary for a proper tracking of reference counts on the data copies. At this point I dont think we can fiddle with the code to allow for device copy propagation, too many corner cases to be able to be workable. Instead, we should rethink the entire data copy framework, and allow tasks to fetch their inputs from devices as needed. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

I'm not sure I understand how this worked so far, because the temporary stack-based task we use to prepare the successors is barely initialized, not enough to call parsec_task_snprintf on it. The least info we need to set is to NULLify the data_in for all task classs flows. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca requested a review from a team as a code owner September 10, 2024 04:35

bosilca force-pushed the topic/cuda_aware_communications branch from 968bf7e to 6f2e034 Compare September 10, 2024 04:38

bosilca mentioned this pull request Sep 10, 2024

Add support for batched tasks and for CUDA-aware communications bosilca/parsec#4

Open

bosilca force-pushed the topic/cuda_aware_communications branch 2 times, most recently from b3dfcdc to 0838a95 Compare September 10, 2024 05:03

This comment was marked as resolved.

Sign in to view

abouteiller force-pushed the topic/cuda_aware_communications branch from efa8386 to ab1a74a Compare October 11, 2024 18:07

This comment was marked as resolved.

Sign in to view

abouteiller force-pushed the topic/cuda_aware_communications branch from cd7c475 to 3bab2d5 Compare October 16, 2024 20:43

therault mentioned this pull request Oct 24, 2024

C11 atomic lock alignment in data_t #685

Merged

abouteiller force-pushed the topic/cuda_aware_communications branch from eb5c782 to 3e0cb38 Compare October 31, 2024 14:53

devreal mentioned this pull request Nov 26, 2024

PaRSEC now allows DSLs to free the gpu task TESSEorg/ttg#307

Open

This was referenced Dec 5, 2024

Thread safe zone_malloc #712

Closed

Make zone-malloc/free thread safe #715

Merged

devreal mentioned this pull request Feb 4, 2025

Rework the parsec_object_t system. #726

Closed

abouteiller linked an issue Feb 12, 2025 that may be closed by this pull request

Direct GPU-GPU inter-node data transfer #661

Open

8 tasks

devreal reviewed Feb 22, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

devreal reviewed Feb 25, 2025

View reviewed changes

parsec/data.c Outdated Show resolved Hide resolved

devreal mentioned this pull request Feb 25, 2025

Provide mechanism to discard data #695

Open

abouteiller reviewed Feb 26, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

bosilca force-pushed the topic/cuda_aware_communications branch from 5555bed to 76bc47f Compare February 13, 2026 05:19

bosilca force-pushed the topic/cuda_aware_communications branch from 76bc47f to c1a0455 Compare February 13, 2026 06:49

bosilca and others added 16 commits February 13, 2026 01:52

Allow JDF with no dependencies, no datatype and no arenas.

bd8c4cc

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Add a CUDA-based RTT test.

3e9e84d

This allows to check if the data can be send and received directly to and from GPU buffers. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Mostly improvement to the debuging output.

0505b25

Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Correctly initialize an unlock atomic lock.

9f0c172

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

gpu-datain: when we are executing a CPU body we need to retrieve the CPU

1468d6a

copy if we are passed-in a GPU copy, and we need to retain/release the copies that we are swapping

gpu-datain: pure distribution collections have no data_of/data_of_key

f3c9a0d

Use lock initializers instead of static temps

30742bc

gpu-datain: when the RO data-in is from D2D we need to increase its

d49fe0e

readers

gpu-datain: add a configure option and change some namings

7c25351

When managing CPU-only tasks that received a GPU data copy, skip outp…

320d62b

…ut-only flows, for which checking if they are control flows segfaults

mpigpu: fix a number of crashes in release_self_contained_data

87ca1b2

testing for dev_index==0 is not always true, use the mca_device_is_gpu

7d2941a

accessor

Use PRIx64 when printing a parsec_data_key_t

5edcf60

Mismatch in printf type sizes might lead to segmentation faults. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Add support for CUDA 13.

237b9d2

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Allow command line arguments for the stress test.

45e0618

Add missing dependency library. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

bosilca force-pushed the topic/cuda_aware_communications branch 2 times, most recently from 0a5fdde to f0dd212 Compare February 13, 2026 07:11

bosilca added 2 commits February 13, 2026 02:15

bosilca force-pushed the topic/cuda_aware_communications branch from f0dd212 to ae5581f Compare February 13, 2026 07:15

Conversation

bosilca commented Sep 10, 2024 • edited by devreal Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

abouteiller commented Oct 11, 2024

Uh oh!

This comment was marked as resolved.

G-Ragghianti commented Jan 13, 2025

Uh oh!

abouteiller commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

devreal Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

abouteiller commented Feb 24, 2025

Uh oh!

Uh oh!

abouteiller Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bosilca Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

abouteiller Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

abouteiller commented Feb 26, 2025

Uh oh!

This comment was marked as resolved.

devreal commented Mar 26, 2025

Uh oh!

This comment was marked as resolved.

abouteiller commented Mar 28, 2025

Uh oh!

bosilca commented Mar 28, 2025

Uh oh!

devreal commented Jul 14, 2025

Uh oh!

bosilca commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bosilca commented Sep 10, 2024 •

edited by devreal

Loading

abouteiller commented Jan 13, 2025 •

edited

Loading

abouteiller Feb 26, 2025 •

edited

Loading